We are gonna be starting with raw data and running through preparing, modeling, visualizing, and analyzing the data. We'll touch on the following points:
Here's how to get the dataset:
When focusing on restaurants alone, there are approximately 41K restaurants with approximately 1M user reviews written about them.
The data is provided in a handful of files in .json format. We'll be using the following files for our demo:
The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples.
In [1]:
import os
data_directory = os.path.join('material', 'yelp', 'source')
businesses_filepath = os.path.join(data_directory, 'yelp_academic_dataset_business.json')
with open(businesses_filepath, encoding='utf_8') as f:
first_business_record = f.readline()
print(first_business_record)
Only a few attributes will be of interest in task:
Moreover, we will focus von restaurant, which is indicated by the presence of the Restaurant tag in the categories array.
The review records are stored in a similar manner — key, value pairs containing information about the reviews.
In [2]:
review_json_filepath = os.path.join(data_directory, 'yelp_academic_dataset_review.json')
with open(review_json_filepath, encoding='utf_8') as f:
first_review_record = f.readline()
print(first_review_record)
The only attributes we a concerned with are:
json is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:
dict
frozenset
of the business IDs for restaurants, which we'll use in the next step
In [3]:
import json
from numpy.random import choice
restaurant_ids = set()
# open the businesses file
with open(businesses_filepath, encoding='utf_8') as f:
# iterate through each line (json record) in the file
for business_json in f:
# convert the json record to a Python dict
business = json.loads(business_json)
# if this business is not a restaurant, skip to the next one
if business['categories'] is None or 'Restaurants' not in business['categories']:
continue
# add the restaurant business id to our restaurant_ids set
restaurant_ids.add(business['business_id'])
# choose a subset of the restaurant ids
subset = []
subset_size = 20000
for restaurant_id, _ in zip(restaurant_ids, range(subset_size)):
subset.append(restaurant_id)
# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(subset)
# print the number of unique restaurant ids in the dataset
print('{} restaurants in the dataset'.format(len(restaurant_ids)))
Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.
In [4]:
intermediate_directory = os.path.join('material', 'yelp', 'intermediate_results')
review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')
In [5]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
review_count = 0
# create & open a new file in write mode
with open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:
# open the existing review json file
with open(review_json_filepath, encoding='utf_8') as review_json_file:
# loop through all reviews in the existing file and convert to dict
for review_json in review_json_file:
review = json.loads(review_json)
# if this review is not about a restaurant, skip to the next one
#import pdb; pdb.set_trace()
if review['business_id'] not in restaurant_ids:
continue
# write the restaurant review as a line in the new file
# escape newline characters in the original review text
review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
review_count += 1
print('Text from {} restaurant reviews written to the new txt file.'.format(review_count))
else:
with open(review_txt_filepath, encoding='utf_8') as review_txt_file:
for review_count, line in enumerate(review_txt_file):
pass
print('Text from {} restaurant reviews in the txt file.'.format(review_count + 1))
spaCy is a natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.
spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
spaCy is written in optimized Cython, which means it's fast.
In [6]:
import spacy
import pandas as pd
import itertools as it
nlp = spacy.load('en')
Let's grab a sample review to play with.
In [7]:
with open(review_txt_filepath, encoding='utf_8') as f:
sample_review = list(it.islice(f, 2, 3))[0]
sample_review = sample_review.replace('\\n', '\n')
print(sample_review)
Handing the review text to spaCy.
In [8]:
%%time
parsed_review = nlp(sample_review)
In [9]:
print(parsed_review)
Looks the same! What happened under the hood?
What about sentence detection and segmentation?
In [10]:
for num, sentence in enumerate(parsed_review.sents):
print('Sentence {}:'.format(num + 1))
print(sentence)
print('')
What about named entity detection?
In [11]:
for num, entity in enumerate(parsed_review.ents):
print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
print('')
What about part of speech tagging?
In [12]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]
pd.DataFrame(list(zip(token_text, token_pos)), columns=['token_text', 'part_of_speech'])
Out[12]:
There is much more, like:
In [13]:
token_attributes = [(token.orth_,
token.prob,
token.is_stop,
token.is_punct,
token.is_space,
token.like_num,
token.is_oov)
for token in parsed_review]
df = pd.DataFrame(token_attributes,
columns=['text',
'log_probability',
'stop?',
'punctuation?',
'whitespace?',
'number?',
'out of vocab.?'])
df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
.applymap(lambda x: 'Yes' if x else ''))
df
Out[13]:
Phrase modeling is an approach to learning combinations of tokens that together represent meaningful multi-word concepts.
We can develop phrase models by looping over the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance.
The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:
...where:
Once our phrase model has been trained on our corpus, we can apply it to new text.
When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token (so new york would become new_york).
We will use the gensim library to help us with phrase modeling.
In [14]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phraser
As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time:
We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.
First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus
generator function will use spaCy to:
... and do so efficiently in parallel, thanks to spaCy's nlp.pipe()
function.
In [15]:
def lemmatized_sentence_corpus(filename):
"""
generator function to use spaCy to parse reviews,
lemmatize the text, and yield sentences
"""
for parsed_review in nlp.pipe(line_review(filename), batch_size=10000, n_threads=4):
for sent in parsed_review.sents:
yield ' '.join([token.lemma_ for token in sent if not punct_space(token)])
def line_review(filename):
"""
generator function to read in reviews from the file
and un-escape the original line breaks in the text
"""
with open(filename, encoding='utf_8') as f:
for review in f:
yield review.replace('\\n', '\n')
def punct_space(token):
"""
helper function to eliminate tokens
that are pure punctuation or whitespace
"""
return token.is_punct or token.is_space
In [16]:
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
We'll write the data back out to a new file (unigram_sentences_all
), with one normalized sentence per line. We'll use this data for learning our phrase models.
In [17]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(review_txt_filepath):
f.write(sentence + '\n')
If your data is organized like we just did gensim's LineSentence class provides a convenient iterator for working with other gensim components.
It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once.
This allows you to scale your modeling pipeline up to potentially very large corpora.
In [18]:
unigram_sentences = LineSentence(unigram_sentences_filepath)
Let's take a look at a few sample sentences in our new, transformed file.
In [19]:
for unigram_sentence in it.islice(unigram_sentences, 100, 110):
print(' '.join(unigram_sentence))
print('---')
Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "rib eye
", to be linked together to form a new, single token: "rib_eye
".
In [20]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')
In [21]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if False:
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)
# A Phraser is smaller and faster
bigram_phraser = Phraser(bigram_model)
Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.
In [22]:
bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')
In [23]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
for unigram_sentence in unigram_sentences:
bigram_sentence = ' '.join(bigram_phraser[unigram_sentence])
f.write(bigram_sentence + '\n')
In [24]:
bigram_sentences = LineSentence(bigram_sentences_filepath)
In [25]:
for bigram_sentence in it.islice(bigram_sentences, 1480, 1500):
print(' '.join(bigram_sentence))
print('---')
We now see two-word phrases, such as "ice_cream
" and "french_toast
", linked together in the text as a single token.
Next, we'll train a second-order phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "rib eye steak
" will become fully joined to "rib_eye_steak
".
In [26]:
trigram_model_filepath = os.path.join(intermediate_directory, 'trigram_model_all')
In [27]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if False:
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)
# A Phraser is smaller and faster
trigram_phraser = Phraser(trigram_model)
We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.
In [28]:
trigram_sentences_filepath = os.path.join(intermediate_directory, 'trigram_sentences_all.txt')
In [29]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
for bigram_sentence in bigram_sentences:
trigram_sentence = ' '.join(trigram_phraser[bigram_sentence])
f.write(trigram_sentence + '\n')
In [30]:
trigram_sentences = LineSentence(trigram_sentences_filepath)
In [31]:
for trigram_sentence in it.islice(trigram_sentences, 1480, 1500):
print(' '.join(trigram_sentence))
print('---')
Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as "rib_eye_steak
".
The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.
In addition, we'll remove stopwords at this point.
Finally, we'll write the transformed text out to a new file, with one review per line.
In [32]:
trigram_reviews_filepath = os.path.join(intermediate_directory, 'trigram_transformed_reviews_all.txt')
In [33]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
for parsed_review in nlp.pipe(line_review(review_txt_filepath),
batch_size=10000, n_threads=4):
# lemmatize the text, removing punctuation and whitespace
unigram_review = [token.lemma_ for token in parsed_review
if not punct_space(token)]
# apply the first-order and second-order phrase models
bigram_review = bigram_phraser[unigram_review]
trigram_review = trigram_phraser[bigram_review]
# remove any remaining stopwords
trigram_review = [term for term in trigram_review
if term not in spacy.en.STOPWORDS]
# write the transformed review as a line in the new file
trigram_review = ' '.join(trigram_review)
f.write(trigram_review + '\n')
Let's grab the same review from the file with the normalized and transformed text, and compare the two.
In [34]:
print('Original:\n')
for review in it.islice(line_review(review_txt_filepath), 49, 50):
print(review)
print('----\n')
print('Transformed:\n')
with open(trigram_reviews_filepath, encoding='utf_8') as f:
for review in it.islice(f, 49, 50):
print(review)
Most of the grammatical structure has been removed from the text — capitalization, articles/conjunctions, punctuation, spacing, etc.
However, much of the general semantic meaning is still present.
Also, multi-word concepts such as "long_story_short
" and "45_min
" have been joined into single tokens, as expected.
The review text is now ready for higher-level modeling.
Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using Latent Dirichlet Allocation or LDA, a popular approach to topic modeling.
In many conventional NLP applications, documents are represented by a mixture of the individual tokens (words and phrases). In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:
LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.
LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.
We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its LdaMulticore class.
In [35]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim
import warnings
import _pickle as pickle
The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.
In [36]:
trigram_dictionary_filepath = os.path.join(intermediate_directory, 'trigram_dict_all.dict')
In [37]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if False:
trigram_reviews = LineSentence(trigram_reviews_filepath)
# learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)
# filter tokens that are very rare or too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
trigram_dictionary.compactify()
trigram_dictionary.save(trigram_dictionary_filepath)
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)
LDA uses the simplifying bag-of-words assumption.
Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The trigram_bow_generator
function implements this. We'll save the resulting bag-of-words reviews as a matrix.
In [38]:
trigram_bow_filepath = os.path.join(intermediate_directory, 'trigram_bow_corpus_all.mm')
In [39]:
def trigram_bow_generator(filepath):
"""
generator function to read reviews from a file
and yield a bag-of-words representation
"""
for review in LineSentence(filepath):
yield trigram_dictionary.doc2bow(review)
In [40]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if False:
# generate bag-of-words representations for
# all reviews and save them as a matrix
MmCorpus.serialize(trigram_bow_filepath,
trigram_bow_generator(trigram_reviews_filepath))
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
Now we can learn our topic model from the reviews.
We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to the LdaMulticore
model.
In [41]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')
In [42]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if False:
with warnings.catch_warnings():
warnings.simplefilter('ignore')
# workers => sets the parallelism, and should be
# set to your number of physical cores minus one
lda = LdaMulticore(trigram_bow_corpus,
num_topics=50,
id2word=trigram_dictionary,
workers=4)
lda.save(lda_model_filepath)
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)
Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.
In [43]:
def explore_topic(topic_number, topn=25):
"""
accept a user-supplied topic number and
print out a formatted list of the top terms
"""
print('{:20} {}\n'.format('term', 'frequency'))
for term, frequency in lda.show_topic(topic_number, topn=25):
print('{:20} {:.3f}'.format(term, round(frequency, 3)))
Iteresting topics are:
0, 1, 8, 10, 11, 15, 17, 21, 23, 24, 28, 32, 35, 40, 42, 46, 48
In [44]:
explore_topic(topic_number=0)
Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic pyLDAvis library to help with that!
pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.
In [45]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')
In [46]:
#%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus, trigram_dictionary)
with open(LDAvis_data_filepath, 'wb') as f:
pickle.dump(LDAvis_prepared, f)
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
LDAvis_prepared = pickle.load(f)
pyLDAvis.display(...)
displays the topic model visualization in-line in the notebook.
In [47]:
pyLDAvis.display(LDAvis_prepared)
Out[47]:
There are a lot of moving parts in the visualization. Here's a brief summary:
The interactive visualization pyLDAvis produces is helpful for both:
For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.
For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% Topic A, 20% Topic B, 20% Topic C, and 10% Topic D.
To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The lda_description(...)
function will perform all these steps for us, including printing the resulting topical description of the input text.
In [54]:
def get_sample_review(review_number):
"""
retrieve a particular review index
from the reviews file and return it
"""
return list(it.islice(line_review(review_txt_filepath), review_number, review_number+1))[0]
In [71]:
def lda_description(review_text, min_topic_freq=0.05):
"""
accept the original text of a review and (1) parse it with spaCy,
(2) apply text pre-proccessing steps, (3) create a bag-of-words
representation, (4) create an LDA representation, and
(5) print a sorted list of the top topics in the LDA representation
"""
# parse the review text with spaCy
parsed_review = nlp(review_text)
# lemmatize the text and remove punctuation and whitespace
unigram_review = [token.lemma_ for token in parsed_review
if not punct_space(token)]
# apply the first-order and secord-order phrase models
bigram_review = bigram_phraser[unigram_review]
trigram_review = trigram_phraser[bigram_review]
# remove any remaining stopwords
trigram_review = [term for term in trigram_review
if not term in spacy.en.STOPWORDS]
# create a bag-of-words representation
review_bow = trigram_dictionary.doc2bow(trigram_review)
# create an LDA representation
review_lda = lda[review_bow]
# sort with the most highly related topics first
review_lda = sorted(review_lda, key=lambda x: -x[1])
for topic_number, freq in review_lda:
if freq < min_topic_freq:
break
# print the most highly related topic names and frequencies
#print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))
print("{}".format([word for word, frq in lda.show_topic(topic_number, topn=10)]))
In [81]:
sample_review = get_sample_review(256)
print(sample_review)
In [82]:
lda_description(sample_review)
Can you complete this text snippet?
You just demonstrated the core machine learning concept behind word vector embedding models!
Word vector models are also fully unsupervised
The general idea of word2vec is, for a given focus word, to use the context of the word
Word2vec has a number of user-defined hyperparameters, including:
For using word2vec in Python, gensim comes to the rescue again! It offers a highly-optimized, parallelized implementation of the word2vec algorithm with its Word2Vec class.
In [83]:
from gensim.models import Word2Vec
trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')
We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.
In [84]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if False:
# initiate the model and perform the first epoch of training
food2vec = Word2Vec(trigram_sentences, size=100, window=5,
min_count=20, sg=1, workers=4)
food2vec.save(word2vec_filepath)
# perform another 11 epochs of training
for i in range(1,12):
food2vec.train(trigram_sentences)
food2vec.save(word2vec_filepath)
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()
print('{} training epochs so far.'.format(food2vec.train_count))
In [85]:
print('{} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))
Let's take a look at the word vectors our model has learned.
In [87]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
for term, voc in food2vec.wv.vocab.items()]
# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])
# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)
# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
index=ordered_terms)
word_vectors
Out[87]:
What is the size of the wall of numbers?
In [92]:
def get_related_terms(token, topn=10):
"""
look up the topn most similar terms to token
and print them as a formatted list
"""
for word, similarity in food2vec.most_similar(positive=[token], topn=topn):
print('{:20} {}'.format(word, round(similarity, 3)))
In [99]:
get_related_terms('burger_king')
The model has learned that fast food restaurants are similar to each other! In particular, mcdonalds and wendy's are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as mcdonalds, mcdonald's and mcd's.
In [100]:
get_related_terms('soccer')
In [101]:
get_related_terms('fork')
In [103]:
get_related_terms('apple')
In [61]:
get_related_terms('happy_hour')
In [107]:
get_related_terms('tip')
In [62]:
get_related_terms('pasta', topn=20)
The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
But more generally, you can think of the vectors that represent each word as encoding some information about the meaning or concepts of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.
In [109]:
def word_algebra(add=[], subtract=[], topn=1):
"""
combine the vectors associated with the words provided
in add= and subtract=, look up the topn most similar
terms to the combined vector, and print the result(s)
"""
answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)
for term, similarity in answers:
print(term)
In [110]:
word_algebra(add=['breakfast', 'lunch'])
In [111]:
word_algebra(add=['lunch', 'night'], subtract=['day'])
In [112]:
word_algebra(add=['taco', 'chinese'], subtract=['mexican'])
In [113]:
word_algebra(add=['bun', 'mexican'], subtract=['american'])
In [114]:
word_algebra(add=['coffee', 'snack'], subtract=['drink'])
In [115]:
word_algebra(add=['burger_king', 'pizza'])
In [118]:
word_algebra(add=['wine', 'hops'], subtract=['grapes'])
scikit-learn provides a convenient implementation of the t-SNE algorithm with its TSNE class.
In [120]:
from sklearn.manifold import TSNE
Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
In [121]:
tsne_input = word_vectors.drop(spacy.en.STOPWORDS, errors='ignore')
tsne_input = tsne_input.head(5000)
In [123]:
tsne_input.head(10)
Out[123]:
In [124]:
tsne_filepath = os.path.join(intermediate_directory, 'tsne_model')
tsne_vectors_filepath = os.path.join(intermediate_directory, 'tsne_vectors.npy')
In [128]:
%%time
if False:
tsne = TSNE()
tsne_vectors = tsne.fit_transform(tsne_input.values)
with open(tsne_filepath, 'wb') as f:
pickle.dump(tsne, f)
pd.np.save(tsne_vectors_filepath, tsne_vectors)
with open(tsne_filepath, 'rb') as f:
tsne = pickle.load(f)
tsne_vectors = pd.np.load(tsne_vectors_filepath)
tsne_vectors = pd.DataFrame(tsne_vectors,
index=pd.Index(tsne_input.index),
columns=['x_coord', 'y_coord'])
Now we have a two-dimensional representation of our data! Let's take a look.
In [129]:
tsne_vectors.head()
Out[129]:
In [130]:
tsne_vectors['word'] = tsne_vectors.index
In [134]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()
In [135]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)
# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
plot_width = 800,
plot_height = 800,
tools= ('pan, wheel_zoom, box_zoom,'
'box_select, resize, reset'),
active_scroll='wheel_zoom')
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@word') )
# draw the words as circles on the plot
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
color='blue', line_alpha=0.2, fill_alpha=0.1,
size=10, hover_line_color='black')
# configure visual elements of the plot
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# engage!
show(tsne_plot);
Whew! Let's round up the major components that we've seen:
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:
...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.